# write code in "chunks" (and comments too, using the # symbol)
print("Hello World")[1] "Hello World"
NUTRIM microbiome & metabolome workshop
During this workshop, we will write code and notes in Quarto documents.
Write notes in “markdown”, e.g. your intentions, hypotheses, observations, etc.
# write code in "chunks" (and comments too, using the # symbol)
print("Hello World")[1] "Hello World"
Insert a chunk with ctrl/cmd + shift + I
print("Run this one line with ctrl + enter, or cmd + enter on macOS")
print("or run a whole chunk with ctrl/cmd + SHIFT + enter")Run all previous chunks by clicking the first button in the corner of the chunk
It is good practice to load all the packages you need at the top of your notebook.
# `readxl` is for reading data from Excel files
library(readxl)# the `here` package makes specifying file locations easier
library(here)here() starts at /Users/david/Documents/github/my_projects/teaching/workshops/2024-NUTRIM-microbiome
# `tidyverse` is a collection of several packages (dplyr, ggplot2, and others)
library(tidyverse) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
How do we read a table of data from a file, e.g. an Excel file?
# give your objects short but informative names!
meta <- read_excel(path = here("data/papa2012/papa2012_metadata.xlsx"), na = "NA")
meds <- read_excel(here("data/papa2012/papa2012_metadata.xlsx"), sheet = "treatment")There are multiple ways to tell R where to find the data files for this project.
👍 The best way 👍
For each data analysis project you do, create a separate folder for that project, and keep all relevant code and data inside that folder.
RStudio also offers additional convenient features for organisation with RStudio Projects!
Using the R package called here you can easily specify the location of your data using file paths relative to the project folder. The here package offers one important function, also called here()
# this will work anywhere! (anywhere that the project folder is moved or copied)
meta <- read_excel(path = here("data/papa2012/papa2012_metadata.xlsx"), na = "NA")This is reliable and portable! 🤩 If you share the entire project folder with a collaborator (e.g. as a zip file or via github) then this code will work on their computer without needing any changes!
😐 Easy but limiting 😐
An absolute path (also known as a full path) can be used to specify where a file is on your own computer.
This will work okay, but only if you never move your files, and never change computer…
Using absolute paths makes it inconvenient to share your project or work collaboratively. 🙁
# this works on my machine, but it won't work anywhere else!
meta <- read_excel(path = "/Users/david/Documents/teaching/workshops/2024-NUTRIM-microbiome/data/papa2012/papa2012_metadata.xlsx", na = "NA")🤷 Commonly used but still problematic 🤷
You can use paths relative to the working directory set with setwd("/an/absolute/path/on/your/computer")
BUT:
setwd still uses an absolute path, so it is still not portable!
setwd does not work in notebooks, which reset working directory every chunk
Prof. Jenny Bryan might set your computer on fire! 🖥️🔥 (read why here)
# this works on my machine, but it won't work anywhere else!
setwd("/Users/david/Documents/teaching/workshops/2024-NUTRIM-microbiome")
meta <- read_excel(path = "data/papa2012/papa2012_metadata.xlsx", na = "NA")Look at the metadata.
meta# A tibble: 90 × 10
ID sample case_control diagnosis activity active gender ethnicity
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 099A 099-AX Case UC severe active female White
2 199A 199-AX Control Other control control female Other
3 062B 062-BZ Case CD mild active male White
4 194A 194-AZ Case UC mild active male White
5 166A 166-AX Case UC severe active female Black
6 219A 219-AX Case UC inactive inactive female <NA>
7 132A 132-AX Case CD mild active female White
8 026A 026-AX Case UC mild active male White
9 102A 102-AZ Control Other control control male White
10 140A 140-AX Control Other control control female White
# ℹ 80 more rows
# ℹ 2 more variables: family_history <chr>, age_years <dbl>
A dataframe is the standard class of object for holding rectangular data (tables) in your R environment.
Columns of a dataframe can hold vectors of different classes, e.g. chr and int (characters and integers)
(In contrast, a matrix can only hold one type of data!)
example_df <- data.frame(alphabet = LETTERS, numbers = 1:26)Try View(example_df) in the Console - but not in a Quarto doc chunk
A tibble is just a dataframe, but with a concise print format, useful!
example_tbl <- as_tibble(example_df)
example_tbl# A tibble: 26 × 2
alphabet numbers
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
# ℹ 16 more rows
Each column in a dataframe/tibble can be on of the following types of vector object
c(7.5, -2.301, 0.666).c(9L, 4L, -123L).c(TRUE, FALSE).c("no", "yes", "unsure").factor(c("no", "yes", "unsure")).table(meta$family_history, useNA = "ifany")
fhx nofhx <NA>
25 54 11
table(meta$age_years, useNA = "ifany")
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 24
2 3 5 1 2 5 5 4 6 10 6 9 7 7 4 4 5 3 1 1
You can count the number of times a category occurs, similar to the table function.
meta %>% count(diagnosis)# A tibble: 3 × 2
diagnosis n
<chr> <int>
1 CD 23
2 Other 24
3 UC 43
Or count combinations of categories.
meta %>% count(family_history, diagnosis)# A tibble: 9 × 3
family_history diagnosis n
<chr> <chr> <int>
1 fhx CD 8
2 fhx Other 4
3 fhx UC 13
4 nofhx CD 12
5 nofhx Other 14
6 nofhx UC 28
7 <NA> CD 3
8 <NA> Other 6
9 <NA> UC 2
Or compute other summary statistics.
meta %>% summarise(age_mean = mean(age_years), age_sd = sd(age_years))# A tibble: 1 × 2
age_mean age_sd
<dbl> <dbl>
1 12.6 4.65
Or compute grouped summary statistics
meta %>% summarise(age_mean = mean(age_years), age_sd = sd(age_years), .by = diagnosis)# A tibble: 3 × 3
diagnosis age_mean age_sd
<chr> <dbl> <dbl>
1 UC 13.8 4.26
2 Other 9.08 4.30
3 CD 14.1 3.84
R’s base graphics can be used to quickly summarise data distributions.
ggplot2 is a popular and powerful plotting package.
library(patchwork)
plot_list <- list()
# age boxplot
plot_list$box <- meta %>%
ggplot(aes(y = case_control, x = age_years)) +
geom_boxplot(
mapping = aes(fill = case_control),
width = 0.15, staplewidth = 0.5, outliers = FALSE,
position = position_nudge(y = 0.2),
show.legend = FALSE
) +
geom_jitter(height = 0.05, width = 0.1, alpha = 0.8, size = 1) +
labs(y = NULL, x = "Age (years)", title = "Age Distributions") +
theme_bw() +
theme(plot.margin = margin(b = 20, r = 20))
# plot_list$box
# family history piechart
plot_list$pie <- meta %>%
ggplot(aes(y = 1, fill = family_history)) +
geom_bar(position = "stack", colour = "black") +
scale_fill_manual(
values = c(fhx = "grey30", nofhx = "grey70"),
na.value = "white", guide = "none"
) +
annotate("text", x = I(pi/8), y = I(0.25), label = "NA") +
annotate("text", x = I(0.9 * pi), y = I(0.25), label = "No FHx") +
annotate("text", x = I(1.7 * pi), y = I(0.25), label = "FHx", colour = "white") +
labs(tag = "Family History") +
labs(x = NULL, y = NULL) +
coord_radial(expand = FALSE, inner.radius = 0.3) +
theme_void() +
theme(plot.tag.location = "panel", plot.margin = margin(t = 10))
# plot_list$pie
# activity barchart
plot_list$bar <- meta %>%
ggplot(aes(
x = diagnosis,
fill = factor(activity, c("severe", "moderate", "mild", "inactive"))
)) +
geom_bar(colour = "black", linewidth = 0.2) +
scale_x_discrete(limits = c("UC", "CD")) +
scale_fill_brewer(name = "Activity", palette = "Reds", direction = -1) +
labs(title = "Activity Level", x = NULL, y = NULL) +
coord_cartesian(ylim = c(0, NA), expand = FALSE) +
theme_classic() +
theme(
plot.margin = margin(l = 20),
legend.key.height = unit(1, "cm"), legend.key.width = unit(3, "mm")
)
# plot_list$bar
# assemble with patchwork
wrap_plots(
A = plot_list$box, C = plot_list$bar,
B = plot_list$pie, guides = "collect",
design =
"AAC
AAC
BBC
BBC
BBC"
)Links to good resources for learning ggplot2
R 4 Data Science book - intro to plots in R
R Graphics Cookbook - quick practical guide
R graph gallery - ideas and example code
ggplot2 website - a reference manual
ggplot2 book - thorough grounding
Cedric Scherer’s blog - tutorial/examples
Metaprogramming, Advanced R - experts only 🤯
Often, your data are not all in one table. For example, there were two sheets of data in the metadata Excel file, which we stored in the dataframe objects meta and meds
The main dataframe meta contains most of the data about each patient:
meta# A tibble: 90 × 10
ID sample case_control diagnosis activity active gender ethnicity
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 099A 099-AX Case UC severe active female White
2 199A 199-AX Control Other control control female Other
3 062B 062-BZ Case CD mild active male White
4 194A 194-AZ Case UC mild active male White
5 166A 166-AX Case UC severe active female Black
6 219A 219-AX Case UC inactive inactive female <NA>
7 132A 132-AX Case CD mild active female White
8 026A 026-AX Case UC mild active male White
9 102A 102-AZ Control Other control control male White
10 140A 140-AX Control Other control control female White
# ℹ 80 more rows
# ℹ 2 more variables: family_history <chr>, age_years <dbl>
The second dataframe meds contains medications info for the IBD cases:
meds# A tibble: 67 × 4
ID diagnosis immunosuppression_level medication
<chr> <chr> <chr> <chr>
1 048A UC level0 mesalamine only
2 119A UC level0 mesalamine only
3 120D UC level0 mesalamine only
4 192A UC level0 mesalamine only
5 215A UC level0 mesalamine only
6 216A UC level0 mesalamine only
7 233A CD level0 mesalamine only
8 009A UC level1 steroids
9 035A UC level1 steroids
10 038A UC level1 steroids
# ℹ 57 more rows
dplyr provides functions to “join” dataframes together, using shared variables.
all_meta <- left_join(meta, meds)Joining with `by = join_by(ID, diagnosis)`
By default the join function will perform a “natural” join using all shared variables. For greater control you can specify a “key” variable, or set of variables, that should be used.
# this should do the same thing as the natural join shown above
all_meta <- left_join(meta, meds, by = join_by(ID, diagnosis))What happened with the Control group? (remember they were not present in meds!)
Inspect the all_meta dataframe to find out! e.g. View(all_meta)
A natural left join is the most common join you will need.
But there are many other useful possibilities, e.g.
Learn more at:
Often, you need to modify your variables, or create new ones.
For simple transformations you can easily do this with base R.
# create a logical variable, TRUE if patient has family history of IBD
all_meta$ibd_fhx <- all_meta$family_history == "fhx"
# always check the result is what you expected!
all_meta[, c("ID", "family_history", "ibd_fhx")]# A tibble: 90 × 3
ID family_history ibd_fhx
<chr> <chr> <lgl>
1 099A nofhx FALSE
2 199A <NA> NA
3 062B nofhx FALSE
4 194A nofhx FALSE
5 166A <NA> NA
6 219A nofhx FALSE
7 132A nofhx FALSE
8 026A nofhx FALSE
9 102A fhx TRUE
10 140A nofhx FALSE
# ℹ 80 more rows
The mutate function from dplyr is great for making multiple or complex transformations. You refer to variables without repeating the name of the dataframe. It is as if you are working “inside” the dataframe.
# this is equivalent to the previous block
all_meta <- all_meta %>% mutate(ibd_fhx = family_history == "fhx")
# check the result again
all_meta %>% select(ID, family_history, ibd_fhx)# A tibble: 90 × 3
ID family_history ibd_fhx
<chr> <chr> <lgl>
1 099A nofhx FALSE
2 199A <NA> NA
3 062B nofhx FALSE
4 194A nofhx FALSE
5 166A <NA> NA
6 219A nofhx FALSE
7 132A nofhx FALSE
8 026A nofhx FALSE
9 102A fhx TRUE
10 140A nofhx FALSE
# ℹ 80 more rows
We can convert character string variables to factor variables to specify an ordering of their levels (e.g. for plotting).
You can temporarily mutate a dataframe and use the result. This is easy with pipes.
all_meta %>%
mutate(diagnosis = factor(diagnosis, levels = c("CD", "UC", "Other"))) %>%
ggplot(aes(diagnosis, age_years)) + geom_boxplot()Notice this did NOT persistently modify the diagnosis variable, it is still “character” class.
class(all_meta$diagnosis) # no persistent changes, diagnosis is still character![1] "character"
We need to clean up the medication history.
table(all_meta$medication, useNA = "if")
abx abx, imsp imsp mesalamine only
1 3 14 7
steroids steroids, abx steroids, abx, imsp steroids, imsp
12 4 11 11
<NA>
27
table(all_meta$immunosuppression_level, useNA = "if")
level0 level1 level2 level3 level4 none <NA>
7 16 18 11 10 4 24
We have NAs for the medication for all the Controls. We know they had “none”, so let’s first indicate that. We will also replace Controls’ immunosuppression_level NA values with “none”.
all_meta <- all_meta %>% mutate(
medication = if_else(case_control == "Control", true = "none", false = medication),
immunosuppression_level = if_else(case_control == "Control", true = "none", false = immunosuppression_level)
)
# check the result!
all_meta %>% select(ID, case_control, medication, immunosuppression_level)# A tibble: 90 × 4
ID case_control medication immunosuppression_level
<chr> <chr> <chr> <chr>
1 099A Case steroids, imsp level3
2 199A Control none none
3 062B Case steroids, imsp level3
4 194A Case steroids level1
5 166A Case abx none
6 219A Case abx, imsp level3
7 132A Case steroids, imsp level2
8 026A Case steroids, abx, imsp level4
9 102A Control none none
10 140A Control none none
# ℹ 80 more rows
We now want logical variables indicating if the patient recently had antibiotics (abx), steroids, or other immunosuppressive drugs (imsp).
all_meta <- all_meta %>% mutate(
abx = str_detect(medication, "abx"),
steroids = str_detect(medication, "steroids"),
imsp_other = str_detect(medication, "imsp"),
imsp_any = steroids | imsp_other
)
# check the result!
# all_meta %>% select(case_control, medication, abx, steroids, imsp_other, imsp_any)For the character variables with more than two values, we can convert them to factors, to encode our preferred ordering of their levels.
all_meta <- all_meta %>% mutate(
diagnosis = factor(diagnosis, c("CD", "UC", "Other")),
active = factor(active, c("control", "inactive", "active")),
activity = factor(activity, c("control", "inactive", "mild", "moderate", "severe"))
)
# you do the same for immunosuppression_level! don't forget to check the resultWe have recapped some R fundamentals and introduced the sample metadata.
Next we will start working with microbiome data!
Click here: LINK TO NEXT PAGE
For reproducibility, it is useful to record the packages and versions used in your analyses. This is easy to do with sessioninfo::session_info().
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Sonoma 14.5
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2024-05-22
pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.3.0)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.1)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.3.1)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1)
gtable 0.3.5 2024-04-22 [1] CRAN (R 4.3.1)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.1)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.1)
patchwork * 1.2.0 2024-01-08 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.3.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.1)
readxl * 1.4.3 2023-07-06 [1] CRAN (R 4.3.0)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1)
rmarkdown 2.26 2024-03-05 [1] CRAN (R 4.3.1)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.3.1)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.3.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.3.1)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.1)
xfun 0.43 2024-03-25 [1] CRAN (R 4.3.1)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────